都会|可能会_###haohaohao###图神经网络之神器——PyTorchGeometric上手&实战

作者：Jerl | 来源：互联网 | 2023-10-16 18:30

篇首语：本文由编程笔记#小编为大家整理，主要介绍了###haohaohao###图神经网络之神器——PyTorchGeometric上手&实战相关的知识，希望对你有一定的参考价值。

篇首语：本文由编程笔记#小编为大家整理，主要介绍了###haohaohao###图神经网络之神器——PyTorch Geometric 上手 & 实战相关的知识，希望对你有一定的参考价值。

图神经网络&＃xff08;Graph Neural Networks, GNN&＃xff09;最近被视为在图研究等领域一种强有力的方法。跟传统的在欧式空间上的卷积操作类似&＃xff0c;GNNs通过对信息的传递&＃xff0c;转换和聚合实现特征的提取。这篇博客主要想分享下&＃xff0c;怎样在你的项目中简单快速地实现图神经网络。你将会了解到怎样用PyTorch Geometric 去构建一个图神经网络&＃xff0c;以及怎样用GNN去解决一个实际问题&＃xff08;Recsys Challenge 2015&＃xff09;。

我们将使用PyTorch 和 PyG&＃xff08;PyTorch Geometric Library&＃xff09;。PyG是一个基于PyTorch的用于处理不规则数据&＃xff08;比如图&＃xff09;的库&＃xff0c;或者说是一个用于在图等数据上快速实现表征学习的框架。它的运行速度很快&＃xff0c;训练模型速度可以达到DGL&＃xff08;Deep Graph Library &＃xff09;v0.2 的40倍&＃xff08;数据来自论文&＃xff09;。除了出色的运行速度外&＃xff0c;PyG中也集成了很多论文中提出的方法&＃xff08;GCN,SGC,GAT,SAGE等等&＃xff09;和常用数据集。因此对于复现论文来说也是相当方便。由于速度和方便的优势&＃xff0c;毫无疑问&＃xff0c;PyG是当前最流行和广泛使用的GNN库。让我们开始吧。

Requirments:

Python 3
PyTorch
PyTorch Geometric

PyG Basics

这部分将会带你了解PyG的基础知识。重要的是会涵盖torch_gemotric.data 和 torch_geometric.nn。 你将会了解到怎样将你的图数据导入你的神经网络模型&＃xff0c;以及怎样设计一个MessagePassing layer。这个也是GNN的核心。

Data

torch_geometric.data这个模块包含了一个叫Data的类。这个类允许你非常简单的构建你的图数据对象。你只需要确定两个东西&＃xff1a;

节点的属性/特征&＃xff08;the attributes/features associated with each node, node features&＃xff09;
邻接/边连接信息&＃xff08;the connectivity/adjacency of each node, edge index&＃xff09;

让我们用一个例子来说明一个写怎样创建一个Data对象。

在这个图里有4个节点&＃xff0c;V1,V2,V3,V4,每一个都带有一个2维的特征向量&＃xff0c;和一个标签y&＃xff0c;代表这个节点属于哪一类。

这两个东西可以用FloatTesonr来表示&＃xff1a;

x &＃61; torch.tensor([[2,1],[5,6],[3,7],[12,0]], dtype&＃61;torch.float) y &＃61; torch.tensor([0,1,0,1], dtype&＃61;torch.float)

图的节点连接信息要以COO格式进行存储。在COO格式中&＃xff0c;COO list 是一个2*E 维的list。第一个维度的节点是源节点(source nodes)&＃xff0c;第二个维度中是目标节点(target nodes)&＃xff0c;连接方式是由源节点指向目标节点。对于无向图来说&＃xff0c;存贮的source nodes 和 target node 是成对存在的。

方式1

edge_index &＃61; torch.tensor([[0,1,2,0,3], [1,0,1,3,2]],dtype&＃61;torch,long)

方式2

edge_index &＃61; torch.tensor([[0, 1], [1, 0], [2, 1], [0, 3] [2, 3]], dtype&＃61;torch.long)

第二种方法在使用时要调用contiguous()方法。

边索引的顺序跟Data对象无关&＃xff0c;或者说边的存储顺序并不重要&＃xff0c;因为这个edge_index只是用来计算邻接矩阵&＃xff08;Adjacency Matrix&＃xff09;。

把它们放在一起我们就可以创建一个Data了。

# 方法一 import torch from torch_geometric.data import Data x &＃61; torch.tensor([[2,1],[5,6],[3,7],[12,0]],dtype&＃61;torch.float) y &＃61; torch.tensor([[0,2,1,0,3],[3,1,0,1,2]],dtype&＃61;torch.long) edge_index &＃61; torch.tensor([[0,1,2,0,3], [1,0,1,3,2]],dtype&＃61;torch,long) data &＃61; Data(x&＃61;x,y&＃61;y,edge_index&＃61;edge_index) # 方法二 import torch from torch_geometric.data import Data x &＃61; torch.tensor([[2,1],[5,6],[3,7],[12,0]],dtype&＃61;torch.float) y &＃61; torch.tensor([[0,2,1,0,3],[3,1,0,1,2]],dtype&＃61;torch.long) edge_index &＃61; torch.tensor([[0, 1], [1, 0], [2, 1], [0, 3] [2, 3]], dtype&＃61;torch.long) data &＃61; Data(x&＃61;x,y&＃61;y,edge_index&＃61;edge_index.contiguous())

这样我们就创建了一个新的Data。其中x,y,edge_index 是最基本的键值&＃xff08;key&＃xff09;。你也可以添加自己的key。有了这个data&＃xff0c;你可以在程序中非常方便的调用处理你的数据。

Dataset

数据集Dataset的创建不像Data一样简单直接了。Dataset有点像torchvision&＃xff0c;它有着自己的规则。

PyG提供两种不同的数据集类&＃xff1a;

InMemoryDataset
Dataset

要创建一个InMemoryDataset&＃xff0c;你必须实现一个函数

Raw_file_names()

它返回一个包含没有处理的数据的名字的list。如果你只有一个文件&＃xff0c;那么它返回的list将只包含一个元素。事实上&＃xff0c;你可以返回一个空list&＃xff0c;然后确定你的文件在后面的函数process()中。

Processed_file_names()

很像上一个函数&＃xff0c;它返回一个包含所有处理过的数据的list。在调用process()这个函数后&＃xff0c;通常返回的list只有一个元素&＃xff0c;它只保存已经处理过的数据的名字。

Download()

这个函数下载数据到你正在工作的目录中&＃xff0c;你可以在self.raw_dir中指定。如果你不需要下载数据&＃xff0c;你可以在这函数中简单的写一个

pass

就好。

Process()

这是Dataset中最重要的函数。你需要整合你的数据成一个包含data的list。然后调用 self.collate()去计算将用DataLodadr的片段。下面这个例子来自PyG官方文档。

import torch from torch_geometric.data import InMemoryDataset class MyOwnDataset(InMemoryDataset): def __init__(self, root, transform&＃61;None, pre_transform&＃61;None): super(MyOwnDataset, self).__init__(root, transform, pre_transform) self.data, self.slices &＃61; torch.load(self.processed_paths[0]) &＃64;property def raw_file_names(self): return [&＃39;some_file_1&＃39;, &＃39;some_file_2&＃39;, ...] &＃64;property def processed_file_names(self): return [&＃39;data.pt&＃39;] def download(self): # Download to &＃96;self.raw_dir&＃96;. def process(self): # Read data into huge &＃96;Data&＃96; list. data_list &＃61; [...] if self.pre_filter is not None: data_list [data for data in data_list if self.pre_filter(data)] if self.pre_transform is not None: data_list &＃61; [self.pre_transform(data) for data in data_list] data, slices &＃61; self.collate(data_list) torch.save((data, slices), self.processed_paths[0])

我将会在后面介绍怎样从RecSys 2015 提供的数据构建一个用于PyG的一般数据集。

DataLoader

DataLoader 这个类允许你通过batch的方式feed数据。创建一个DotaLoader实例&＃xff0c;可以简单的指定数据集和你期望的batch size。

loader &＃61; DataLoader(dataset, batch_size&＃61;512, shuffle&＃61;True)

DataLoader的每一次迭代都会产生一个Batch对象。它非常像Data对象。但是带有一个‘batch’属性。它指明了了对应图上的节点连接关系。因为DataLoader聚合来自不同图的的batch的x,y 和edge_index&＃xff0c;所以GNN模型需要batch信息去知道那个节点属于哪一图。

for batch in loader: batch >>> Batch(x&＃61;[1024, 21], edge_index&＃61;[2, 1568], y&＃61;[512], batch&＃61;[1024])

MessagePassing

这个GNN的本质&＃xff0c;它描述了节点的embeddings是怎样被学习到的。

作者已经将MessagePassing这个接口写好&＃xff0c;以便于大家快速实现自己的想法。如果想使用这个框架&＃xff0c;就要重新定义三个方法&＃xff1a;

message
update
aggregation scheme

在实现message的时候&＃xff0c;节点特征会自动map到各自的source and target nodes。 aggregation scheme 只需要设置参数就好&＃xff0c;sum, mean or max。

对于一个简单的GCN来说&＃xff0c;我们只需要按照以下步骤&＃xff0c;就可以快速实现一个GCN&＃xff1a;

添加self-loop 到邻接矩阵&＃xff08;Adjacency Matrix&＃xff09;。
节点特征的线性变换。
标准化节点特征。
聚合邻接节点信息。
得到节点新的embeddings

步骤1 和 2 需要在message passing 前被计算好。 3 - 5 可以torch_geometric.nn.MessagePassing 类。

添加self-loop的目的是让featrue在聚合的过程中加入当前节点自己的feature&＃xff0c;没有self-loop聚合的就只有邻居节点的信息。

Example 1 下面是官方文档的一个GCN例子&＃xff0c;其中注释中的Step 1-5对应上文的步骤1-5.

import torch from torch_geometric.nn import MessagePassing from torch_geometric.utils import add_self_loops, degree class GCNConv(MessagePassing): def __init__(self, in_channels, out_channels): super(GCNConv, self).__init__(aggr&＃61;&＃39;add&＃39;) # "Add" aggregation. self.lin &＃61; torch.nn.Linear(in_channels, out_channels) def forward(self, x, edge_index): # x has shape [N, in_channels] # edge_index has shape [2, E] # Step 1: Add self-loops to the adjacency matrix. edge_index, _ &＃61; add_self_loops(edge_index, num_nodes&＃61;x.size(0)) # Step 2: Linearly transform node feature matrix. x &＃61; self.lin(x) # Step 3-5: Start propagating messages. return self.propagate(edge_index, size&＃61;(x.size(0), x.size(0)), x&＃61;x) def message(self, x_j, edge_index, size): # x_j has shape [E, out_channels] # Step 3: Normalize node features. row, col &＃61; edge_index deg &＃61; degree(row, size[0], dtype&＃61;x_j.dtype) deg_inv_sqrt &＃61; deg.pow(-0.5) norm &＃61; deg_inv_sqrt[row] * deg_inv_sqrt[col] return norm.view(-1, 1) * x_j def update(self, aggr_out): # aggr_out has shape [N, out_channels] # Step 5: Return new node embeddings. return aggr_out

所有的逻辑代码都在forward()里面&＃xff0c;当我们调用propagate()函数之后&＃xff0c;它将会在内部调用message()和update()。

Example 2 下面是一个SAGE的例子

import torch from torch.nn import Sequential as Seq, Linear, ReLU from torch_geometric.nn import MessagePassing from torch_geometric.utils import remove_self_loops, add_self_loops class SAGEConv(MessagePassing): def __init__(self, in_channels, out_channels): super(SAGEConv, self).__init__(aggr&＃61;&＃39;max&＃39;) # "Max" aggregation. self.lin &＃61; torch.nn.Linear(in_channels, out_channels) self.act &＃61; torch.nn.ReLU() self.update_lin &＃61; torch.nn.Linear(in_channels &＃43; out_channels, in_channels, bias&＃61;False) self.update_act &＃61; torch.nn.ReLU() def forward(self, x, edge_index): # x has shape [N, in_channels] # edge_index has shape [2, E] edge_index, _ &＃61; remove_self_loops(edge_index) edge_index, _ &＃61; add_self_loops(edge_index, num_nodes&＃61;x.size(0)) return self.propagate(edge_index, size&＃61;(x.size(0), x.size(0)), x&＃61;x) def message(self, x_j): # x_j has shape [E, in_channels] x_j &＃61; self.lin(x_j) x_j &＃61; self.act(x_j) return x_j def update(self, aggr_out, x): # aggr_out has shape [N, out_channels] new_embedding &＃61; torch.cat([aggr_out, x], dim&＃61;1) new_embedding &＃61; self.update_lin(new_embedding) new_embedding &＃61; self.update_act(new_embedding) return new_embedding

上面的部分主要介绍了怎样把数据编程Data&＃xff0c;以及通过MessagPassing来实现自己的想法&＃xff0c;也就是怎样生成新的embeddings。至于怎样训练模型可以看下面的内容&＃xff0c;以及参考官方的示例&＃xff08;https://github.com/rusty1s/pytorch_geometric/tree/master/examples&＃xff09;。

A Real-World Example —— RecSys Challenge 2015

RecSys Challenge 2015 是一个推荐算法竞赛。参与者被要求完成以下两个任务&＃xff1a;

通过一个点击序列预测是否会产生一个购买行为。
预测哪个产品将要被购买。

首先&＃xff0c;我们可以在官网&＃xff08;https://recsys.acm.org/recsys15/challenge/&＃xff09;下载数据并且构建成一个数据集。然后开始做第一个任务&＃xff0c;因为它比较简单。竞赛提供了两个主要的数据集。 yoochoose-clicks.dat 和 yoochoose-buys.dat&＃xff0c;分别各自包含点击事件和购买事件。

Preprocessing

在下载完数据之后&＃xff0c;我们需要对它进行预处理&＃xff0c;这样它可以被fed进我们的模型。

from sklearn.preprocessing import LabelEncoder df &＃61; pd.read_csv(&＃39;../input/yoochoose-click.dat&＃39;, header&＃61;None) df.columns&＃61;[&＃39;session_id&＃39;,&＃39;timestamp&＃39;,&＃39;item_id&＃39;,&＃39;category&＃39;] buy_df &＃61; pd.read_csv(&＃39;../input/yoochoose-buys.dat&＃39;, header&＃61;None) buy_df.columns&＃61;[&＃39;session_id&＃39;,&＃39;timestamp&＃39;,&＃39;item_id&＃39;,&＃39;price&＃39;,&＃39;quantity&＃39;] item_encoder &＃61; LabelEncoder() df[&＃39;item_id&＃39;] &＃61; item_encoder.fit_transform(df.item_id) df.head()

因为数据集很大。我们用子图以方便演示。

#randomly sample a couple of them sampled_session_id &＃61; np.random.choice(df.session_id.unique(), 1000000, replace&＃61;False) df &＃61; df.loc[df.session_id.isin(sampled_session_id)] df.nunique()

为了确定一个ground truth。对于一个给定的session&＃xff0c;是否存在一个购买事件。我们简单的检查是否一个 session_id 在 yoochoose-clicks.dat 也出现在 yoochoose-buys.dat 中。

df[&＃39;label&＃39;] &＃61; df.session_id.isin(buy_df.session_id) df.head()

数据集的构建 Dataset Construction

在预处理步骤之后&＃xff0c;就可以将数据转换为Dataset对象了。在这里&＃xff0c;我们将session中的每个item都视为一个节点&＃xff0c;因此同一session中的所有items都形成一个图。为了构建数据集&＃xff0c;我们通过session_id对预处理的数据进行分组&＃xff0c;并在这些组上进行迭代。在每次迭代中&＃xff0c;对每个图中节点索引应0开始。

import torch from torch_geometric.data import InMemoryDataset from tqdm import tqdm class YooChooseBinaryDataset(InMemoryDataset): def __init__(self, root, transform&＃61;None, pre_transform&＃61;None): super(YooChooseBinaryDataset, self).__init__(root, transform, pre_transform) self.data, self.slices &＃61; torch.load(self.processed_paths[0]) &＃64;property def raw_file_names(self): return [] &＃64;property def processed_file_names(self): return [&＃39;../input/yoochoose_click_binary_1M_sess.dataset&＃39;] def download(self): pass def process(self): data_list &＃61; [] # process by session_id grouped &＃61; df.groupby(&＃39;session_id&＃39;) for session_id, group in tqdm(grouped): sess_item_id &＃61; LabelEncoder().fit_transform(group.item_id) group &＃61; group.reset_index(drop&＃61;True) group[&＃39;sess_item_id&＃39;] &＃61; sess_item_id node_features &＃61; group.loc[group.session_id&＃61;&＃61;session_id,[&＃39;sess_item_id&＃39;,&＃39;item_id&＃39;]].sort_values(&＃39;sess_item_id&＃39;).item_id.drop_duplicates().values node_features &＃61; torch.LongTensor(node_features).unsqueeze(1) target_nodes &＃61; group.sess_item_id.values[1:] source_nodes &＃61; group.sess_item_id.values[:-1] edge_index &＃61; torch.tensor([source_nodes, target_nodes], dtype&＃61;torch.long) x &＃61; node_features y &＃61; torch.FloatTensor([group.label.values[0]]) data &＃61; Data(x&＃61;x, edge_index&＃61;edge_index, y&＃61;y) data_list.append(data) data, slices &＃61; self.collate(data_list) torch.save((data, slices), self.processed_paths[0])

在构建好数据集&＃xff0c;我们使用shuffle()方法确保数据集被随机打乱。然后把数据集分成 3份&＃xff0c;分别用作 training validation and testing.

dataset &＃61; dataset.shuffle() train_dataset &＃61; dataset[:800000] val_dataset &＃61; dataset[800000:900000] test_dataset &＃61; dataset[900000:] len(train_dataset), len(val_dataset), len(test_dataset)

Build a Graph Neural Networks

以下GNN引用了PyG官方Github存储库中的示例之一&＃xff0c;并使用上面的example 2的SAGEConv层(不同于官方文档中的SAGEConv())。此外&＃xff0c;还对输出层进行了修改以与binary classification设置匹配。

embed_dim &＃61; 128 from torch_geometric.nn import TopKPooling from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp import torch.nn.functional as F class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 &＃61; SAGEConv(embed_dim, 128) self.pool1 &＃61; TopKPooling(128, ratio&＃61;0.8) self.conv2 &＃61; SAGEConv(128, 128) self.pool2 &＃61; TopKPooling(128, ratio&＃61;0.8) self.conv3 &＃61; SAGEConv(128, 128) self.pool3 &＃61; TopKPooling(128, ratio&＃61;0.8) self.item_embedding &＃61; torch.nn.Embedding(num_embeddings&＃61;df.item_id.max() &＃43;1, embedding_dim&＃61;embed_dim) self.lin1 &＃61; torch.nn.Linear(256, 128) self.lin2 &＃61; torch.nn.Linear(128, 64) self.lin3 &＃61; torch.nn.Linear(64, 1) self.bn1 &＃61; torch.nn.BatchNorm1d(128) self.bn2 &＃61; torch.nn.BatchNorm1d(64) self.act1 &＃61; torch.nn.ReLU() self.act2 &＃61; torch.nn.ReLU() def forward(self, data): x, edge_index, batch &＃61; data.x, data.edge_index, data.batch x &＃61; self.item_embedding(x) x &＃61; x.squeeze(1) x &＃61; F.relu(self.conv1(x, edge_index)) x, edge_index, _, batch, _ &＃61; self.pool1(x, edge_index, None, batch) x1 &＃61; torch.cat([gmp(x, batch), gap(x, batch)], dim&＃61;1) x &＃61; F.relu(self.conv2(x, edge_index)) x, edge_index, _, batch, _ &＃61; self.pool2(x, edge_index, None, batch) x2 &＃61; torch.cat([gmp(x, batch), gap(x, batch)], dim&＃61;1) x &＃61; F.relu(self.conv3(x, edge_index)) x, edge_index, _, batch, _ &＃61; self.pool3(x, edge_index, None, batch) x3 &＃61; torch.cat([gmp(x, batch), gap(x, batch)], dim&＃61;1) x &＃61; x1 &＃43; x2 &＃43; x3 x &＃61; self.lin1(x) x &＃61; self.act1(x) x &＃61; self.lin2(x) x &＃61; self.act2(x) x &＃61; F.dropout(x, p&＃61;0.5, training&＃61;self.training) x &＃61; torch.sigmoid(self.lin3(x)).squeeze(1) return x

Training

训练自定义GNN非常容易&＃xff0c;只需迭代从训练集构造的DataLoader&＃xff0c;然后反向传播损失函数。在这里&＃xff0c;使用Adam作为优化器&＃xff0c;将学习速率设置为0.005&＃xff0c;将Binary Cross Entropy作为损失函数。

def train(): model.train() loss_all &＃61; 0 for data in train_loader: data &＃61; data.to(device) optimizer.zero_grad() output &＃61; model(data) label &＃61; data.y.to(device) loss &＃61; crit(output, label) loss.backward() loss_all &＃43;&＃61; data.num_graphs * loss.item() optimizer.step() return loss_all / len(train_dataset) device &＃61; torch.device(&＃39;cuda&＃39;) model &＃61; Net().to(device) optimizer &＃61; torch.optim.Adam(model.parameters(), lr&＃61;0.005) crit &＃61; torch.nn.BCELoss() train_loader &＃61; DataLoader(train_dataset, batch_size&＃61;batch_size) for epoch in range(num_epochs): train()

Validation

标签存在大量的negative标签&＃xff0c;数据是高度不平衡的&＃xff0c;因为大多数会话之后都没有任何购买事件。换句话说&＃xff0c;一个愚蠢的模型可能会预测所有的情况为negative&＃xff0c;从而使准确率达到90&＃xff05;以上。因此&＃xff0c;代替准确度&＃xff0c;AUC是完成此任务的更好指标&＃xff0c;因为它只在乎阳性实例的得分是否高于阴性实例。我们使用来自Sklearn的现成AUC计算功能。

def evaluate(loader): model.eval() predictions &＃61; [] labels &＃61; [] with torch.no_grad(): for data in loader: data &＃61; data.to(device) pred &＃61; model(data).detach().cpu().numpy() label &＃61; data.y.detach().cpu().numpy() predictions.append(pred) labels.append(label)

Result

以下是对模型进行1个epoch的训练&＃xff0c;并打印相关参数&＃xff1a;

for epoch in range(1): loss &＃61; train() train_acc &＃61; evaluate(train_loader) val_acc &＃61; evaluate(val_loader) test_acc &＃61; evaluate(test_loader) print(&＃39;Epoch: :03d, Loss: :.5f, Train Auc: :.5f, Val Auc: :.5f, Test Auc: :.5f&＃39;. format(epoch, loss, train_acc, val_acc, test_acc))

Conclusion

到此&＃xff0c;你已经学会了PyG的基本用法&＃xff0c;包括数据集的构建&＃xff0c;定制GNN网络&＃xff0c;训练GNN模型。以上代码以及主要内容均来自于官方文档以及https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 这个博客。希望对你有所帮助。更多PyG的介绍和example可以查询官方文档和官方的Github库。

最后&＃xff0c;

Sharing is carrying.

参考链接&＃xff1a;

Fast Graph Representation Learning with PyTorch Geometricrlgm.github.io

PyTorch Geometric Documentationpytorch-geometric.readthedocs.io

Hands-on Graph Neural Networks with PyTorch & PyTorch Geometrictowardsdatascience.com